Bayesian Model Averaging, Default Priors & Prediction Accuracy

1 Introduction

Bayesian model averaging (BMA), now widely accepted as a principled way of accounting for model uncertainty, involves the challenge of the elicitation of priors for all parameters of each model and the probability of each model. It is this elicitation of prior knowledge about the parameters of interest which often leads to more reliable estimates and smaller uncertainties. In some cases, the elicitation of prior knowledge is essential to obtaining meaningful results. If substantial prior information is available in the form of a probability distributions, it should be used. In many cases; however, prior information may be unavailable or minuscule with respect to the information provided by the data. In such cases, default priors can be used to characterize prior probability distributions of model parameters. Still, inappropriate priors may unduly influence posterior-based inferences and decision making.

This paper examines the effect of default priors on Bayesian regression model selection, model size, posterior inclusion probabilities of regressors, and on predictive performance.

These issues are illustrated in the context of a linear regression model to predict audience scores for movies, given 645 observations and 16 regression parameters. Nine candidate default parameter priors were evaluated. BMA reference models, which best describe knowledge about future observations, served as proxies for the “true” data generating models. Candidate models were evaluated vis-a-vis the BMA reference models on the basis of predictive performance.

1.1 Background

The effect of default priors on BMA, model selection, and predictive performance have been studied in range of disciplines including, econometrics, social sciences, biostatistics. Ley and Steel (Ley and Steel 2008) evaluated the predictive performance of several priors including Maruyama-George (Maruyama and George 2011), Bottolo-Richardson (Bottolo and Richardson 2010), Hyper g, Hyper g/n, and Zellner Siow (A Zellner and Siow 1980) priors. Combining Binomial-Beta priors on model sizes with g-priors on the coefficients of each model, they proposed a benchmark Beta prior as well as a hyper-g/n prior for econometric applications, specifically cross-country growth regression. Eicher, Papageorgiou, and Raftery (Eicher, Papageorgiou, and Raftery 2011) evaluated 12 priors which have been proposed in statistics and economics literature and found that the Unit Information Prior (UIP), which corresponds to the Bayesian Information Criteria (BIC) approximation of the marginal likelihood, combined with the uniform prior over the model space, outperformed, the 11 other priors. Liang et al. assessed the predictive performance of a range of g-priors and mixtures of g-priors using the highest probability models under each prior, rather than BMA (Liang et al. 2008). Based upon reported mean squared error (MSE) of the predictions, they concluded that there were no statistically significant differences in the predictive performance of the mixtures vis-a-vis the fixed g priors. Fernandez et al. (Fernah Ndez, Ley, and Steel 2001) evaluated nine priors based upon Zellner’s g-prior structure (Arnold Zellner and Moulton 1985) and recommended a g-prior equal to \(1/k^2\) when \(n\leq k^2\) and \(1/n\) when \(n > k^2\), where \(n\) is the number of observations and \(k\) is the number of regressors.

The papers is organized as follows. Section 2 summarizes BMA key concepts, with a focus on prior parameter specification. Section 3 introduces the motivating example and the data used for this experiment. Section 4 includes a univariate and bivariate exploratory data analysis (EDA). Section 5 covers model prior comparison, evaluation and selection. Section 6 reports the prediction accuracy of top performing models on unseen data. Finally, section 7 includes concluding remarks.

BMA modeling and prediction functionality was provided by the BAS package (M. Clyde 2017). Core scripts and functions used in this study are included in the appendix. The complete source code used to produce this report is also available on github at https://github.com/DataScienceSalon/Bayesian-Regression.


2 Theoretical Framework

2.1 Bayesian Model Averaging (BMA)

The objective of variable selection in linear regression is to identify a set of candidate predictors that produce the “best” model. Here, “best” model may mean the model that best predicts unseen cases, or that which is best explained by the data. Given an independent variable \(Y\) and a set of \(k\) candidate predictors \(X_1, X_2, ..., X_k\), the “best” model is more concretely described as follows: \[Y = \beta_0 + \displaystyle\sum_{j=1}^{p}\beta_jX_j + \epsilon = X\beta + \epsilon\] where:
\(X_1, X_2, ..., X_p\) is a subset of \(X_1, X_2, ..., X_k\)
\(X\) is a \(n \times (p + 1)\) matrix containing the observed data on \(p\) predictors
\(\epsilon\) ~ \(N(0,\sigma^2I)\)
\(\beta\) and \(\sigma^2\) are the \((p + 1)\) individual unknown parameters

However, model selection exercises that lead to a single “best” model ignore model uncertainty (Hodges 1987; Draper 1995; Raftery, Madigan, and Hoeting 1997), and leads to underestimation of uncertainty when making inferences about quantities of interest (Raftery, Madigan, and Hoeting 1997).

Bayesian model averaging, employed when a variety of statistically reasonable models are extant, addresses model uncertainty and leads to optimal predictive ability (A. Raftery, Madigan, and Hoeting 1994) by average over all possible combinations of predictors when making inferences about quantities of interest. The final estimates are computed as a weighted average of the parameter estimates from each of the models.

The standard BMA solution, first introduced by Leamer in 1978, defines a set of all possible models as \(M = ({M_1,..., M_k})\). If \(\Delta\) is the quantity of interest, such as a movie prediction or tomorrow’s stock price, then the posterior distribution of \(\Delta\) given the data \(D\) is: \[Pr(\Delta|D) = \displaystyle\sum_{k=1}^KPr(\Delta|M_k, D)Pr(M_k|D)\] This is an average of the posterior distributions under each model weighted by the corresponding posterior model probabilities (Raftery, Madigan, and Hoeting 1997). The posterior model probability of \(M_k\) is then computed as the ratio of its marginal likelihood to the sum of the marginal likelihoods over the entire model space and is given by(Amini and Parmeter, n.d.):

\[Pr(M_k|D) = \frac{Pr(D|M_k)Pr(M_k)}{\displaystyle\sum_{i=1}^{K}Pr(D|M_i)Pr(M_i)}\] where:
\[Pr(D|M_k) = \int p(D|\theta_k, M_k)P(\theta_k|M_k)d\theta_k\] is the marginal likelihood of model \(M_k\), \(\theta_k\) is the vector of parameters of model \(M_k\), \(Pr(\theta_k|M_k)\) is the prior density of \(\theta_k\) under model \(M_k\), \(Pr(D|\theta_k, M_k)\) is the likelihood of the data given the \(M_k\) and \(\theta_k\), and \(Pr(M_k)\) is the prior probability that \(M_k\) is the true model.

At this stage, the posterior inclusion probability of each candidate predictor \(\beta_p\), \(Pr(\beta_p\neq0|D)\), is obtained by summing the posterior model probabilities over all models that include \(\beta_p\). Referring back to the linear regression model; the posterior means and standard deviations of coefficient vectors \(\beta\), are defined as: \[E[\hat{\beta}|D] = \displaystyle\sum_{j=1}^{2^k}\hat{\beta}Pr(M_j)|D),\] \[V[\hat{\beta}|D] = \displaystyle\sum_{j=1}^{2^k}Var[{\beta|D,M_j] +\hat\beta^2})Pr(M_j|D) - E[\beta|D]^2.\] Averaging over all models in this way provides better predictive results, as measured by the logarithmic scoring rule, than any single model \(M_j\) \((j 1,...,K)\) (Raftery, Madigan, and Hoeting 1997).

2.2 Prior Distributions of Parameters

To implement BMA, one must specify prior distributions over all parameters in all models, as well as prior probabilities of the models themselves. If prior information about the parameters and the models is available, it should be used. However, if the amount of prior information is small relative to the effort required to specify it, as is often the case, default or so-called “non-informative” or “reference” priors may be used for such analysis. The selection of default priors may effect the integrated likelihood, the key factor in computing posterior model weights, and so the prior density should be wide enough and reasonably flat over the region of the parameter space where the likelihood is large, but not so spread out as to decrease the prior at the posterior mode. This decreases the integrated likelihood and may unnecessarily penalize larger models (A. Raftery, Madigan, and Hoeting 1994).

This experiment explores the following nine default priors supported in literature (Table 1).

Table 1: Parameter priors
# Prior Comment Source
1 BIC Reference prior based upon log marginal likelihood estimated using Bayesian information criterion. Schwarz (1978)
2 AIC Reference prior based upon log marginal likelihood estimated using Akaike information criterion. Akaike (1973)
3 Empirical Bayes (Global) Prior Empirical bayes method using a global estimate of g from the marginal likelihood of g. George and Foster (2000) and Clyde and George (2000)
4 Empirical Bayes (Local) Prior Empirical bayes method using a model specific estimates of g. George and Foster (2000) and Clyde and George (2000)
5 g-prior Informative, conjugate normal prior conditioned on variance with prior mean and data dependent variance/covariance. Zellner (1986)
6 Hyper-g Class of g-priors based upon continuous proper hyperprior f(g), giving a mixture of g-priors where the prior on g/(1+g) is a Beta(1, alpha/2). Liang, Paulo, Molina, Clyde and Berger (2007)
7 Hyper-g/n A special case of the Hyper-g prior where u = g/n and u ~ Beta(1, alpha/2). Liang, Paulo, Molina, Clyde and Berger (2007)
8 Hyper-g Laplace Approximation A Laplace approximation to the Hyper-g priors Wood and Butler (2002)
9 Zellner Siow Zellner-Siow priors represented as a mixture of g priors with an Inv-Gamma(1/2, n/2): Zellner and Siow (1980)

The Bayesian Information Criterior (BIC) prior is a reference prior based upon an approximation of the log marginal likelihood of parameter value \(\theta\) using the Bayesian information criterion.
\[log Pr(\theta|M_k) \approx c - 1/2BIC_k\] where:
\[BIC_k = nlog(1-R^2_k) + p_klog(n)\]
The \(R^2\) and \(p_k\) variables are the coefficient of determination and the number of regressors for model \(M_k\), respectively, \(n\) is the number of observations, and \(c\) is a constant that does not vary across models and so cancels in the model averaging. This prior is typically flat where the likelihood is large and contains the same amount of information that would be contained in a typical single observation (Wasserman and Kass 1996). The Akaike information criterion (AIC) prior is the same as the BIC prior above, except AIC is used to approximate the likelihood of the data given a model \(M_k\). The remaining priors are based upon the multivariate normal-gamma conjugate prior where the conditional distribution \(X|\theta \sim N(\mu,1/\lambda\sigma^2)\) and \(\beta \sigma^2\sim N(\beta_0, g\sigma^2S^{-1}_{XX})\), where the scaled variance and covariances are obtained from the ordinary least squares (OLS). This reduces prior elicitation down to two components, the prior mean of \(\beta_0\), taken to be 0, and the scalar g. Empirical Bayes Global Prior(EP-G) uses an EM algorithm to find a common or global estimate of g, averaged over all models (M. Clyde 2017). Empirical Bayes Local Prior(EP-L) uses the MLE of g from the marginal likelihood within each model (M. Clyde 2017). The Zellner’s g-prior is a multivariate normal-gamma conjugate prior where \(\beta \sigma^2\sim N(\beta_0, g\sigma^2S^{-1}_{XX})\), \(g = n\), and the scaled variance and covariances are obtained from the ordinary least squares (OLS).Hyper-g is a class of g-priors based upon continuous proper hyperprior f(g), giving a mixture of g-priors where the prior on g/(1+g) is a Beta(1, alpha/2) (M. Clyde 2017). Hyper-g-n is a special case of the Hyper-g prior where u = g/n and u ~ Beta(1, alpha/2), to provide consistency when the null model is true (M. Clyde 2017). Hyper-g Laplace is the same as the Hyper-g prior, but uses Laplace approximation to integrate over the prior on g. Lastly, the Zellner-Siow prior places a gamma prior on n/g with G(shape = 1/2, scale = 1/2).

2.3 Model Priors

For this experiment, the uniform distribution that assigns equal prior probability to all models was used such that \(Pr(M_k) = 1 / K\) for each \(k\) (A. E. Raftery 1988).


3 Data

The motivating example for this experiment is the prediction of audience scores for feature films based upon critics ratings, movie runtimes, MPAA rating, Oscar performance, box office returns, genre, and season of theatrical release (Table 2) information obtained from the IMDb (Needham 1990) Rotten Tomatoes (Flixter 2017) and BoxOfficeMojo (Fritz 2008) websites.

Table 2: Movie Dataset Variables
Predictor Type Subtype Source Description
audience_score Qualitative Continuous rottentomatoes.com The percentage of Rotten Tomatoes ratings that were 3.5 or greater out of 5
feature_film Qualitative Dichotomous imdb.com “yes” if title_type is Feature Film, “no” otherwise
drama Qualitative Dichotomous imdb.com “yes” if genre is Drama, “no” otherwise
runtime Quantitative Continuous imdb.com film runtime in minutes
mpaa_rating_R Qualitative Dichotomous imdb.com “yes” if mpaa_rating is R, “no” otherwise
thtr_rel_year Qualitative Ordinal imdb.com Year of theatrical release
oscar_season Qualitative Dichotomous imdb.com “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise
summer_season Qualitative Dichotomous imdb.com “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise
imdb_rating Quantitative Continuous imdb.com Weighted average of votes (from 1 to 10) by IMDb registered users.
imdb_num_votes Quantitative Continuous imdb.com The number of votes placed by IMDb registered users.
critics_score Quantitative Continuous rottentomatoes.com The percent score or Tomatometer score from critics ratings
best_pic_nom Qualitative Dichotomous imdb.com “yes” if the film was nominated for Best Picture Oscar, “no” otherwise
best_pic_win Qualitative Dichotomous imdb.com “yes” if the film won Best Picture Oscar, “no” otherwise
best_actor_win Qualitative Dichotomous imdb.com “yes” if the film won Best Actor Oscar, “no” otherwise
best_actress_win Qualitative Dichotomous imdb.com “yes” if the film won Best Actress Oscar, “no” otherwise
best_dir_win Qualitative Dichotomous imdb.com “yes” if the film won Best Director Oscar, “no” otherwise
top200_box Qualitative Dichotomous boxofficemojo.com “yes” if the film was in the top 200 box office earners for the year of theatrical release, “no” otherwise

3.1 Data Sources

Launched in October 1990 by Col Needham, the Internet Movie Database (abbreviated IMDb) is an online database of film information, audience and critics ratings, plot summaries and reviews. As of November 2017, the site contained over 4.6 million titles, 8.2 million personalities, and hosts 80 million registered users (Needham 1990). Rotten Tomatoes.com, so named from the practice of audiences throwing rotten tomatoes when disapproving of a poor stage performance, was launched officially in April 2000 by Berkeley student, Senh Duong. It provides audience and critics ratings for some 26 million users worldwide (Flixter 2017). Founded in 1999, BoxOfficeMojo tracks box office information and publishes the data on its website (Fritz 2008).

3.2 Generalizability & Causality

The data were randomly sampled from available IMDb and Rotten Tomatoes website apis, and so the inference should be generalizable to the population. Since this was an observational study, random assignment was not performed, as such causality is not indicated by this analysis.


4 Exploratory Data Analysis

This exploratory data analysis (EDA) is comprised of three parts: (1) the univariate analysis which examines the frequencies and proportions of categorical variables and the centrality, variability, spread and shape of the distributions of quantitative variables, (2) the bivariate analysis which explores the relationships between the independent variables and audience scores, and (3) an association/correlation analysis which reveals any potential collinearity that may arise as a consequence of the relationships among the independent variables.

4.1 Univariate Analysis

4.1.1 Qualitative Analysis

4.1.1.1 Summary Variables

Figure 1, shows the counts and percentages for three summary variables: the drama, the feature film, and the MPAA R rating indicator variables. The drama genre constituted nearly half of the observations in the sample. A plurality of films were R rated and nearly all films surveyed were indeed, feature films.

Figure 1: Drama, features and R Rated Films

4.1.1.2 Film Performance Variables

Figure 2 summarizes the counts and proportions of films in the sample which have achieved notability at the box office or with the Academy. Best Picture winners, top 200 box office earners and Best Picture nominees claimed the top one, two, and three percent of the films, respectively. Films earning best director, best actress, and best actor Oscars were slightly less rarefied at 7%, 11%, and 14% of the sample respectively.

Figure 2: Oscar Awards and Top 200 Box Office Class

4.1.1.3 Theatrical Release Season

The Oscar season, starting in October and lasting until 31 December, marks the period in which Hollywood studios release their more critically acclaimed films. As indicated in Figure 3, approximately 30% of films in the data set were released during the Oscar season.

Figure 4: Oscar and Summer Season Releases

The summer season, which starts the first weekend of May and ends on Labour Day, accounts for disproportionate share of Hollywood studio’s annual box office revenue. Similarly, some 32% of the feature films in the data set were launched during the summer months.

4.1.1.4 Year of Theatrical Release

The data set contained some 651 feature films released between 1970 and 2014. As presented in Figure 5, the number of films in the sample by year of release, tended to grow somewhat linearly from 1 films in 1,970until a peak of approximately 3333 films in 2,006 and in 2,007, then drops until settling at 2,014 films in 2014. The number of films per year centered at a mean and median of 14.65909 and 14 films, respectively.

Figure 5: Theatrical Releases by Year

4.1.2 Quantitative Analysis

4.1.2.1 Critics Score

Moving on to the quantitative variables, the critics score, ranging from 1 to 100, was obtained from the Rotten Tomatoes website and its summary statistics are described below in Table 3.

Table 3: Critics score summary statistics
N Min Q1 Median Mean Q3 IQR Max NA.s SD CV Kurtosis Skewness Outliers
645 1 33 61 57.6 83 50 100 0 28.4 49.3 -1.18 -0.27 0

The distribution of critics scores represented in Figure 6 and further supported by Figure 7 departs rather substantively from normality. That said, Bayesian inference does not rely upon an assumption of normality with respect to the distribution of predictors. Figure 6: Critics score histogram and QQ Plot

Figure 7: Critics score box plot

Central Tendency: Table 3 reports that the central tendency for critics score was 61 points and 57.6 points for the median and mean, respectively.

Dispersion: The standard deviation, s = 28.4, corresponds with a coefficient of variation of 49.3%, indicating a moderate degree of dispersion.

Shape of Distribution: The sample skewness (-0.27), indicated that the distribution of critics score was approximately symmetric. The sample kurtosis (-1.18), indicated that the distribution of critics score was platykurtic or light-tailed. The histogram and QQ plot in Figure 6 reveals a left skewed distribution that departs from normality. Fortunately, Bayesian inference is not based upon an assumption of normality of predictors.

Outliers: The box plot in Figure 7, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 33, 83, and 50, respectively. This yielded a 1.5xIQR ‘acceptable’ range [-42, 158]. Indeed, this confirmed the existence of no outliers. A case-wise review of the influential points revealed no data quality errors. As such, the influential points would be retained for further analysis.

4.1.2.2 IMDb Number of Votes

This variable, obtained from the IMDb website represents the number of IMDb votes cast for each film.

Table 4: IMDb votes summary statistics
N Min Q1 Median Mean Q3 IQR Max NA.s SD CV Kurtosis Skewness Outliers
645 180 4821 15449 58029.3 58907 54086 893008 0 112526.1 193.9 19.96 4.04 68

Figure 8: IMDb votes histogram and QQ Plot

Figure 9: IMDb votes box plot

Central Tendency: The summary statistics (Table 4) show that the central tendency for imdb num votes was 15,449 votes and 58,029.3 votes for the median and mean, respectively.

Dispersion: The standard deviation, s = 112,526.11, corresponds with a coefficient of variation of 193.9%, indicating a very high degree of dispersion.

Shape of Distribution: The sample skewness (4.04), indicated that the distribution of imdb num votes was right-skewed. The sample kurtosis (19.96), indicated that the distribution of imdb num votes was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 8 reveal a distribution which departs significantly from normality.

Outliers: The box plot in Figure 9, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 4,821, 58,907, and 54,086, respectively. This yielded a 1.5xIQR ‘acceptable’ range [-76,308, 140,036]. Indeed, this confirmed the existence of 68 outliers. A case-wise review of the influential points revealed no data quality errors. As such, the influential points would be retained for further analysis.

4.1.2.3 IMDb Number of Votes (Log)

This was a log transformation of the IMDb votes variable.

Table 5: Log IMDb votes summary statistics
N Min Q1 Median Mean Q3 IQR Max NA.s SD CV Kurtosis Skewness Outliers
645 7.5 12.2 13.9 14 15.8 3.6 19.8 0 2.41 17.2 -0.5 -0.01 0

Figure 10: Log IMDb votes histogram and QQ Plot

Figure 11: Log IMDb votes box plot

Central Tendency: The summary statistics (Table 5) report that the central tendency for imdb num votes (log) was 13.9 votes and 14 votes for the median and mean, respectively.

Dispersion: The standard deviation, s = 2.41, corresponds with a coefficient of variation of 17.2%, indicating a low degree of dispersion.

Shape of Distribution: The sample skewness (-0.01), indicated that the distribution of imdb num votes (log) was approximately symmetric. The sample kurtosis (-0.5), indicated that the distribution of imdb num votes (log) was platykurtic or light-tailed. The histogram and QQ plot in Figure 10 reveal a nearly normal distribution.

Outliers: The box plot in Figure 11, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that no outliers were extant. The 25%, 75%, and IQR were 12.2, 15.8, and 3.6, respectively. This yielded a 1.5xIQR ‘acceptable’ range [6.8, 21.2]. Indeed, this confirmed the existence of no outliers. A case-wise review of the influential points revealed no data quality errors. As such, the influential points would be retained for further analysis.

4.1.2.4 IMDb Ratings

This variable captured the IMDb rating for each film

Table 6: IMDb rating summary statistics
N Min Q1 Median Mean Q3 IQR Max NA.s SD CV Kurtosis Skewness Outliers
645 1.9 5.9 6.6 6.5 7.3 1.4 9 0 1.08 16.6 1.32 -0.9 19

Figure 12: IMDb rating histogram and QQ Plot

Figure 13: IMDb rating box plot

Central Tendency: The summary statistics (Table 6) shows that the central tendency for imdb rating was 6.6 points and 6.5 points for the median and mean, respectively.

Dispersion: The standard deviation, s = 1.08, corresponds with a coefficient of variation of 16.6%, indicating a low degree of dispersion.

Shape of Distribution: The sample skewness (-0.9), indicated that the distribution of imdb rating was left-skewed. The sample kurtosis (1.32), indicated that the distribution of imdb rating was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 12 reveal a nearly normal distribution.

Outliers: The box plot in Figure 13, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 5.9, 7.3, and 1.4, respectively. This yielded a 1.5xIQR ‘acceptable’ range [3.8, 9.4]. Indeed, this confirmed the existence of 19 outliers. A case-wise review of the influential points revealed no data quality errors. As such, the influential points would be retained for further analysis.

4.1.2.5 Runtime

This is an analysis of moving runtimes.

Table 7: Runtime summary statistics
N Min Q1 Median Mean Q3 IQR Max NA.s SD CV Kurtosis Skewness Outliers
645 39 92 103 105.9 116 24 267 0 19.49 18.4 8.99 1.75 17

Figure 14: Runtime histogram and QQ Plot

Figure 15: Runtime box plot

Central Tendency: The summary statistics (Table 7) show that the central tendency for runtime was 103 minutes and 105.9 minutes for the median and mean, respectively.

Dispersion: The standard deviation, s = 19.49, corresponds with a coefficient of variation of 18.4%, indicating a low degree of dispersion.

Shape of Distribution: The sample skewness (1.75), indicated that the distribution of runtime was right-skewed. The sample kurtosis (8.99), indicated that the distribution of runtime was leptokurtic or heavy-tailed. The histogram and QQ plot in Figure 14 reveals a left skewed distribution that appears reasonably normal.

Outliers: The box plot in Figure 15, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant. The 25%, 75%, and IQR were 92, 116, and 24, respectively. This yielded a 1.5xIQR ‘acceptable’ range [56, 152]. Indeed, this confirmed the existence of 17 outliers. A case-wise review of the influential points revealed no data quality errors. As such, the influential points would be retained for further analysis.

4.1.2.6 Audience Score

At last, the dependent variable, “audience_score” is examined.

Table 8: Audience Score Summary Statistics
N Min Q1 Median Mean Q3 IQR Max NA.s SD CV Kurtosis Skewness Outliers
645 11 46 65 62.4 80 34 97 0 20.13 32.3 -0.9 -0.35 0

Figure 16: Audience Score Histogram and QQ Plot

Figure 17: Audience Score Box plot

Central Tendency: The summary statistics (Table 8) shows that the central tendency for audience score was 65 points and 62.4 points for the median and mean, respectively.

Dispersion: The standard deviation, s = 20.13, corresponds with a coefficient of variation of 32.3%, indicating a moderate degree of dispersion.

Shape of Distribution: The sample skewness (-0.35), indicated that the distribution of audience score was approximately symmetric. The sample kurtosis (-0.9), indicated that the distribution of audience score was platykurtic or light-tailed. The histogram and QQ plot in Figure 16 reveals a slightly right skewed distribution that approximates normality.

Outliers: The box plot in Figure 17, which graphically depicts the median, the IQR, and maximum and minimum values, suggested that outliers were extant.

4.2 Bivariate Analysis

Next, the relationships between the independent variables and audience scores are studied. The analysis continues with an exploration of the categorical variables vis-a-vis audience score, then an examination of the quantitative variables and the dependent variable.

4.2.1 Qualitative Analysis

4.2.1.1 Best Actor Oscar

The summary statistics in Table 9 evince very similar distributions of audience scores between those films which have won the best actor Oscar and those that had not.

Table 9: Audience Scores by Best Actor Oscar Win Summary Statistics
x Min Max Median Mean IQR SD N
no 13 96 65 62.2 34 20.2 552
yes 11 97 64 63.3 33 19.8 93

The box plot shown in Figure 18 supports an initial impression that best actor Oscar winnings have no statistically significant association with audience scores.

Figure 18: Audience Scores by Best Actor Oscar Win

A one-way ANOVA was conducted to compare the effect of best actor oscar on average audience scores. The effect was non significant at the p<.05 level [F(1,643) = 0, p = 0.654] As such, the data does not support an association between best actor Oscar award and audience score.

4.2.1.2 Best Actress Oscar

Similarly, the summary statistics in Table 10 reveal almost identical distributions for audience score for both groups.

Table 10: Audience Scores by Best Actress Oscar Win Summary Statistics
x Min Max Median Mean IQR SD N
no 11 96 65 62.2 34.0 20.2 574
yes 14 97 69 63.7 30.5 19.5 71

Again, the box plot shown in Figure 19 graphically supports an assertion of little to no association between best actress Oscar winning and audience scores.

Figure 19: Audience Scores by Best Actress Oscar Win

A one-way ANOVA was conducted to compare the effect of best actress oscar on average audience scores. The effect was non significant at the p<.05 level [F(1,643) = 0, p = 0.556]

4.2.1.3 Best Director Oscar

The summary statistics shown in Table 11 suggest that films which have been awarded the best director Oscar are also associated with higher average audience scores.

Table 11: Audience Scores by Best Director Oscar Win Summary Statistics
x Min Max Median Mean IQR SD N
no 11 96 65 61.9 33.75 20.1 602
yes 27 97 73 69.5 32.50 19.0 43

The box plot in Figure 20 reveals a slightly higher center for audience scores among those films with best director acclaim.

Figure 20: Audience Scores by Best Director Oscar Win

A one-way ANOVA was conducted to compare the effect of best director oscar on average audience scores. The effect was significant at the p<.05 level [F(1,643) = 6, p = 0.016] Though winning films tended to have slightly higher audience scores, the data do not indicate a statistically significant difference.

4.2.1.4 Best Picture Nomination

The association between Oscar performance and audience scores becomes extant for the first time with the best picture nomination. The summary statistics shown in Table 12 reveal a sizable difference in average and median audience scores between the films so nominated and those that were not.

Table 12: Audience Scores by Best Picture Nomination Summary Statistics
x Min Max Median Mean IQR SD N
no 11 96 64.0 61.6 33.5 20.0 623
yes 69 97 86.5 85.3 9.5 7.7 22

The box plot in Figure 21 supports a priori hypothesis that best picture nominations are associated with higher audience scores.

Figure 21: Audience Scores by Best Picture Nomination

A one-way ANOVA was conducted to compare the effect of best picture oscar nomination on average audience scores. The effect was significant at the p<.05 level [F(1,643) = 31, p = 0] Indeed the data support an association between best picture nomination and audience scores.

4.2.1.5 Best Picture Oscar

As one my expect, given prior results, the summary statistics shown in Table 13 suggests an association between Oscar best picture acclaim and higher audience scores.

Table 13: Audience Scores by Best Picture Oscar Summary Statistics
x Min Max Median Mean IQR SD N
no 11 96 65 62.1 33.0 20.1 638
yes 69 97 84 84.7 7.5 9.0 7

Similarly, the box plot in Figure 22 clarifies the potential association between Oscar performance and audience scores.

Figure 22: Audience Scores by Best Picture Oscar

A one-way ANOVA was conducted to compare the effect of best picture oscar on average audience scores. The effect was significant at the p<.05 level [F(1,643) = 9, p = 0.003] Surely, the data do in fact support an association between best picture award and audience scores.

4.2.1.6 Drama

The summary statistics in Table 14 report a slight difference in the central audience scores between dramas and non-drama films.

Table 14: Audience Scores by Genre Summary Statistics
x Min Max Median Mean IQR SD N
no 11 97 61 59.8 38 21.2 344
yes 13 95 70 65.3 28 18.4 301

The box plot in Figure 23 also shows a slight tendency towards higher audience scores for dramas, but it significant?

Figure 23: Audience Scores by Genre

A one-way ANOVA was conducted to compare the effect of drama on average audience scores. The effect was significant at the p<.05 level [F(1,643) = 12, p = 0] Dramas are indeed, associated with slightly higher audience scores.

4.2.1.7 Features Films

The summary statistics in Table 15 expose a rather sizable difference in audience scores between feature films and other film types. In fact, feature films appear to be associated with significantly lower average audience scores.

Table 15: Audience Scores by Film Type Summary Statistics
x Min Max Median Mean IQR SD N
no 68 96 86 83.5 11.0 7.5 54
yes 11 97 62 60.5 33.5 19.8 591

The box plot in Figure 24 visually characterizes the difference in the distribution of audience scores between the film types.

Figure 24: Audience Scores by Film Type

A one-way ANOVA was conducted to compare the effect of feature film on average audience scores. The effect was significant at the p<.05 level [F(1,643) = 72, p = 0] The data shows that non feature films are associated with higher audience scores.

4.2.1.8 Rated R

The summary statistics in Table 16 suggest very similar distributions of audience scores between rated R and other films.

Table 16: Audience Scores by MPAA Rating Summary Statistics
x Min Max Median Mean IQR SD N
no 11 96 65 62.8 31.50 20.2 319
yes 14 97 64 62.0 34.75 20.0 326

The box plot in Figure 25 shows a slightly lower central audience score for R rated films.

Figure 25: Audience Scores by MPAA Rating

A one-way ANOVA was conducted to compare the effect of mpaa r rating on average audience scores. The effect was non significant at the p<.05 level [F(1,643) = 0, p = 0.646] As such the data do not support an association between audience scores and MPAA R ratings.

4.2.1.9 Oscar Season Release

Again, the summary statistics in Table 17 report nearly identical distributions of audience scores between films released during the Oscar season and those that were not.

Table 17: Audience Scores and Oscar Season Release Summary Statistics
x Min Max Median Mean IQR SD N
no 11 96 64 61.8 33 20.0 456
yes 13 97 69 63.9 33 20.3 189

The box plot in Figure 26 echo the summary statistics.

Figure 26: Audience Scores and Oscar Season Release

A one-way ANOVA was conducted to compare the effect of oscar season release on average audience scores. The effect was non significant at the p<.05 level [F(1,643) = 1, p = 0.231] As such the data do not support an association between audience scores and Oscar season release dates.

4.2.1.10 Summer Season Release

Similarly, the summary statistics in Table 18 report nearly identical distributions of audience scores between films released during the summer season and those that were not.

Table 18: Audience Scores and Summer Season Release Summary Statistics
x Min Max Median Mean IQR SD N
no 11 96 64 61.8 33 20.0 456
yes 13 97 69 63.9 33 20.3 189

The box plot in Figure 27 backs the summary statistics.

Figure 27: Audience Scores and Summer Season Release

A one-way ANOVA was conducted to compare the effect of summer season release on average audience scores. The effect was non significant at the p<.05 level [F(1,643) = 1, p = 0.231] Therefore the data do not indicate that an association between audience scores and summer release dates is extant.

4.2.1.11 Top 200 Box Office

As one might conjecture, the top 200 box office films might be associated with higher audience scores, almost be definition. This is indicated by the summary statistics in Table 19 which show significantly higher central audience scores for the highest grossing films.

Table 19: Audience Scores and Summer Season Release Summary Statistics
x Min Max Median Mean IQR SD N
no 11 97 65 62.1 33.0 20.1 630
yes 34 92 81 74.5 12.5 16.6 15

The box plot in Figure 28 illuminates this difference.

Figure 28: Audience Scores and Summer Season Release

A one-way ANOVA was conducted to compare the effect of top 200 box office on average audience scores. The effect was significant at the p<.05 level [F(1,643) = 6, p = 0.018] Notwithstanding, the data doesn’t support the assertion that the highest grossing films are more popular from an audience score perspective.

4.2.2 Quantitative Analysis

4.2.2.1 Critics Score

The scatter plot (Figure 29) indicates a positive correlation between critics score and audience score.

Figure 29: Audience Score and Critics Score

A Pearson product-moment correlation coefficient was computed to assess the relationship between critics score and audience scores. There was a positive correlation between the two variables, r = 0.703 n = 643 p = 0. As supported by the scatterplot, a strong positive correlation between critics score and audience score was extant

4.2.2.2 IMDB Num Votes

The scatter plot (Figure 30) indicates a moderate positive correlation between the number of IMDB votes and audience score.

Figure 30: Audience Score and IMDB Num Votes

A Pearson product-moment correlation coefficient was computed to assess the relationship between imdb num votes and audience scores. There was a positive correlation between the two variables, r = 0.292 n = 643 p = 0. As supported by the scatterplot, a weak positive correlation between imdb num votes and audience score was extant

4.2.2.3 IMDB Num Votes (Log)

The scatter plot (Figure 31) reveals a slight positive correlation between the log of IMDB number of votes and audience score.

Figure 31: Audience Score and IMDB Number of Votes (Log)

A Pearson product-moment correlation coefficient was computed to assess the relationship between imdb num votes (log) and audience scores. There was a positive correlation between the two variables, r = 0.185 n = 643 p = 0. As supported by the scatterplot, a weak positive correlation between imdb num votes (log) and audience score was extant

4.2.2.4 IMDB Rating

The scatter plot (Figure 32) suggests a strong positive correlation between IMDB rating and audience score.

Figure 32: Audience Score and IMDB Rating

A Pearson product-moment correlation coefficient was computed to assess the relationship between imdb rating and audience scores. There was a positive correlation between the two variables, r = 0.862 n = 643 p = 0. As supported by the scatterplot, a strong positive correlation between imdb rating and audience score was extant

4.2.2.5 Runtime

The scatter plot (Figure 33) suggests a weak positive correlation between runtime and audience score.

Figure 33: Audience Score and Runtime

A Pearson product-moment correlation coefficient was computed to assess the relationship between runtime and audience scores. There was a positive correlation between the two variables, r = 0.176 n = 643 p = 0. As supported by the scatterplot, a weak positive correlation between runtime and audience score was observed

4.2.2.6 Year of Theatrical Release

The scatter plot (Figure 34) suggests the lack of a correlation between the year of theatrical release and audience score.

Figure 34: Audience Score and Year of Theatrical Release

A Pearson product-moment correlation coefficient was computed to assess the relationship between year of theatrical release and audience scores. There was a negative correlation between the two variables, r = -0.052 n = 643 p = 0.185. As supported by the scatterplot, a weak negative correlation between year of theatrical release and audience score was observed

4.3 Association and Correlation

Next, the significance and strength of associations between categorical variables were examined. Pairwise chi-squared and association tests were conducted to reveal the significance (p.value) and the strength, Cramer’s V (Hoel, Wolfowitz, and Cramer 1947) of each association. The chi-square results summarized in Table 20 reveal several associations that could present as collinearity issues for regression. Focusing on those regressors most highly associated with audience score, the degree of association among Academy awarded films was significant. There was also a strong association between films that won Best Picture and those that were nominated. Both significant and strong associations were observed among films with theatrical releases in the Oscar and Summer seasons.

Table 20: Chi-squared test of independence between categorical variables. Significant (\(\alpha = 0.05\)) differences between observed and expected differences under the null hypothesis are highlighted in bold text.
Terms Feature Film Drama R Rating Year Oscar Season Summer Season Best Pic Nom Best Pic Win Best Actor Best Actress Best Director Top 200
Feature Film NA 0 0 0.001 0.468 0.243 0.293 0.906 0.032 0.044 0.077 0.476
Drama 0 NA 0 0.165 0.206 0.124 0.007 0.347 0.013 0.001 0.161 0.793
R Rating 0 0 NA 0.096 1 0.72 1 1 0.91 0.923 0.234 0.033
Year 0.001 0.165 0.096 NA 0.175 0.842 0.637 0.699 0.572 0.755 0.069 0.497
Oscar Season 0.468 0.206 1 0.175 NA 0 0 0.226 0.001 0.064 0.041 0.075
Summer Season 0.243 0.124 0.72 0.842 0 NA 0.098 1 0.045 0.729 0.813 0.701
Best Pic Nom 0.293 0.007 1 0.637 0 0.098 NA 0 0 0 0 0.155
Best Pic Win 0.906 0.347 1 0.699 0.226 1 0 NA 0.596 0.001 0 0.395
Best Actor 0.032 0.013 0.91 0.572 0.001 0.045 0 0.596 NA 0.001 0.017 0.32
Best Actress 0.044 0.001 0.923 0.755 0.064 0.729 0 0.001 0.001 NA 0.057 0.123
Best Director 0.077 0.161 0.234 0.069 0.041 0.813 0 0 0.017 0.057 NA 0.601
Top 200 0.476 0.793 0.033 0.497 0.075 0.701 0.155 0.395 0.32 0.123 0.601 NA
Table 21: Cramer’s V measure of association between categorical variables. Associations with a medium strength (Cramer’s V > .3) are highlighted in bold text.
Terms Feature Film Drama R Rating Year Oscar Season Summer Season Best Pic Nom Best Pic Win Best Actor Best Actress Best Director Top 200
Feature Film NA 0.283 0.205 0.346 0.035 0.052 0.057 0.032 0.092 0.088 0.081 0.047
Drama 0.283 NA 0.155 0.284 0.053 0.064 0.115 0.052 0.103 0.138 0.061 0.021
R Rating 0.205 0.155 NA 0.293 0.003 0.017 0.002 0.014 0.009 0.009 0.053 0.094
Year 0.346 0.284 0.293 NA 0.283 0.229 0.247 0.242 0.251 0.237 0.298 0.256
Oscar Season 0.035 0.053 0.003 0.283 NA 0.443 0.179 0.064 0.133 0.078 0.087 0.081
Summer Season 0.052 0.064 0.017 0.229 0.443 NA 0.074 0.008 0.084 0.019 0.016 0.026
Best Pic Nom 0.057 0.115 0.002 0.247 0.179 0.074 NA 0.475 0.166 0.207 0.155 0.084
Best Pic Win 0.032 0.052 0.014 0.242 0.064 0.008 0.475 NA 0.042 0.154 0.332 0.083
Best Actor 0.092 0.103 0.009 0.251 0.133 0.084 0.166 0.042 NA 0.138 0.103 0.054
Best Actress 0.088 0.138 0.009 0.237 0.078 0.019 0.207 0.154 0.138 NA 0.085 0.077
Best Director 0.081 0.061 0.053 0.298 0.087 0.016 0.155 0.332 0.103 0.085 NA 0.041
Top 200 0.047 0.021 0.094 0.256 0.081 0.026 0.084 0.083 0.054 0.077 0.041 NA

As shown in Figure 35, the correlations among the quantitative variables did not surprise. As expected, a moderate correlation between critics scores and IMDb rating was observed.

                        data.imdb_rating data.critics_score data.runtime data.imdb_num_votes_log
data.imdb_rating               1.0000000          0.7646853    0.2643424               0.2057645
data.critics_score             0.7646853          1.0000000    0.1701741               0.0727904
data.runtime                   0.2643424          0.1701741    1.0000000               0.2902915
data.imdb_num_votes_log        0.2057645          0.0727904    0.2902915               1.0000000

Figure 35: Correlations among quantitative predictors

4.4 EDA Summary

Whereas acknowledgements of individual achievement from the Academy had no statistically significant correlation with audience scores, films that were nominated for, or won Best Picture, were associated with higher audience scores to a statistically significant degree. It would appear that audiences prefer teamwork. Non-feature films were also associated with higher audience scores. As one might expect, high positive correlations were extant between critics scores, IMDb ratings and audience scores. Weak positive correlations between IMDb number of votes and runtime were observed and finally, audience scores saw a slight, but statistically significant downward trend over the time period observed. The associations between the qualitative and quantitative variables revealed potential collinearity among the Oscar award variables. There was a moderate to strong correlation between critics scores and IMDb ratings.


5 Models

Here the principle question of this paper. the effect of default priors on Bayesian regression model selection, model size, posterior inclusion probabilities of regressors, and on predictive performance, is investigated in four steps. First, explore BMA within the context of each of the nine default parameter priors under consideration. Second, evaluate model performance in terms of predictive accuracy of the BMA, highest probability, best predictive and median predictive models for each of the default priors. Lastly, examine the parameter estimates for the models which performed best on a squared error basis. The best of the models would be used to predict audience scores for a heretofore, unobserved set of feature films released in 2016.

5.1 Bayesian Model Averaging

Bayesian model averaging was conducted on the full data of 645 observations.

5.1.1 Model Spaces

Figure 36 shows the model spaces for each of the nine default parameter priors. The rows correspond to each of the variables and the intercept term, the names of which are indicated along the y-axis. The columns pertain to the models used in the averaging process, whereby the width of the columns represent the posterior probabilities of the models. A wider column indicates a higher posterior probability, the value of which is indicated along the x-axis. The models are sorted left to right, from highest posterior probability to lowest. Variables excluded from the model are shown in black for each column. The colored areas represent variables included in the model, whereby the color is associated with, and proportional to, the log posterior probability of the variable. Models that have the same color have similar log probabilities.

The largest models were those based upon the AIC prior, with model sizes ranging from 11 to 13 regressors for the top five models. The top five Zellner’s g-prior models included between 4 and 7 parameters. The BIC and Empirical Bayes (Global) priors generated nearly identical model spaces ranging from 3 to 5 predictors for the top five models. The remaining model spaces, containing between 3 to 5 predictors, were almost identical.

Figure 36: Model spaces under BMA for the nine candidate parameter priors: model prior = uniform

Figure 37: Model posterior probabilities vis-a-vis complexity

5.1.2 Parameter Posterior Inclusion Probabilities under BMA

Table 22 reports the BMA posterior inclusion probabilities for all nine prior distributions applied to the complete movie dataset. The number of predictors with an inclusion probability exceeding 50%, (not including the intercept) ranged from a low of 2 (BIC, EB-Local, Hyper-g, Hyper g/n, Hyper-g Laplace and Zellner-Siow priors) to a high of 11 regressors for the AIC prior.

Table 22: Posterior inclusion probabilities across parameter priors: model prior = uniform
BIC AIC EB-global EB-local g-prior hyper-g hyper-g-laplace hyper-g-n ZS-null
Intercept 1 1 1 1 1 1 1 1 1
feature_film 0.08 0.72 0.09 0.09 0.43 0.09 0.09 0.07 0.08
drama 0.04 0.54 0.05 0.05 0.31 0.05 0.06 0.04 0.05
runtime 0.48 0.91 0.51 0.5 0.75 0.5 0.5 0.45 0.48
mpaa_rating_R 0.23 0.67 0.25 0.25 0.53 0.25 0.25 0.21 0.23
thtr_rel_year 0.09 0.8 0.11 0.11 0.5 0.12 0.12 0.09 0.1
oscar_season 0.07 0.38 0.08 0.08 0.29 0.08 0.08 0.07 0.08
summer_season 0.07 0.35 0.08 0.08 0.28 0.08 0.08 0.07 0.08
imdb_rating 1 1 1 1 1 1 1 1 1
critics_score 0.9 0.97 0.91 0.9 0.92 0.9 0.9 0.89 0.9
best_pic_nom 0.13 0.67 0.15 0.15 0.48 0.16 0.16 0.12 0.14
best_pic_win 0.04 0.31 0.05 0.05 0.23 0.05 0.05 0.04 0.04
best_actor_win 0.14 0.55 0.15 0.16 0.41 0.16 0.16 0.13 0.15
best_actress_win 0.14 0.58 0.15 0.15 0.42 0.15 0.15 0.13 0.14
best_dir_win 0.07 0.4 0.07 0.08 0.29 0.08 0.08 0.06 0.07
top200_box 0.05 0.28 0.05 0.06 0.23 0.06 0.06 0.05 0.05
imdb_num_votes_log 0.1 0.85 0.12 0.13 0.55 0.13 0.13 0.1 0.11
Note:
Posterior inclusion probabilities that exceed 50% are in bold font.

Figure 38 shows the average posterior inclusion probability across all priors for each parameter. IMDb ratings, critics score, and runtime had average inclusion probabilities exceeding 50% across all priors.

Figure 38: Mean posterior inclusion probabilities across all nine priors.

5.2 Model Performance

Now, the predictive performance of competing default priors are compared on the basis of mean squared error (MSE) on hold-out samples. Predictions were rendered using the BAS package (M. Clyde 2017), for four model estimators: (1) the BMA model, (2) the best predictive model (BPM), (3) the highest probability model (HPM), and (4) the median probability model (MPM) estimators. Predictions for each of the nine priors and the four model estimators were rendered for a total of 36 predictions. The movie data set was randomly split into a training set, \(D_{train}\), (80% of observations) which was used to train the model, and a test set, \(D_{test}\), (20% of observations) which was used to assess the quality of the resulting predictive distributions. The mean MSE was computed for each prior and estimator, for a total of 36 models. This analysis was repeated for the 36 models, 400 times for 400 different random splits. The average MSE of predictions for each model was computed as follows: \[1/j * 1/n_{test}\displaystyle\sum_{j = 1}^{t}\displaystyle\sum_{i = 1}^{n_{test}^{t}}(y_{test} - \hat{y}_{test})^2\] where:
\(t\) is the total number of trials
\(n_{test}^{t}\) is the total number of observations in in \(D_{test}^j\) \(y_{test}\) is the observed audience score for an observation in \(D_{test}^j\)
\(\hat{y}_{test}\) is the predicted audience score for an observation in \(D_{test}\)

Table 23 shows the predictive performance of the nine parameter priors and four estimators, in conjunction with the uniform model priors a evaluated by the MSE.

Table 23: Parameter priors and predictive performance by prior and estimator (movie dataset, model prior: uniform); 100 subsamples
Prior BMA BPM HPM MPM
Bayesian Information Criteria (BIC) 102.584 103.262 103.357 101.924
Akaike Information Criterion (AIC) 101.634 102.906 103.001 99.878
Empirical Bayes (Global) 102.586 103.292 103.404 101.935
Empirical Bayes (Local) 102.57 103.253 103.543 101.954
Zellner’s g-prior 103.116 104.355 105.179 101.683
Hyper-g 102.566 103.255 103.503 101.944
Hyper-g Laplace 102.562 103.252 103.473 101.934
Hyper-g-n 102.653 103.405 103.641 102.141
Zellner-Siow (NULL) 102.59 103.279 103.47 101.972
Note:
The five lowest MSE scores are in bold font. The lowest MSE is in red font.

The top five scores are highlighted in bold and the lowest average MSE is noted in red font. The Akaike Information Criterion (AIC) prior using the MPM estimator outperformed the other 35 models; however not decisively. To ascertain the significance of the differences in mean MSE among the models, t-tests of the MSE means were conducted at an \(\alpha = .05\) significance level. Table 24 reports the top four models in order of ascending mean MSE and the p.value for the probability that \(\mu_m = \mu_{best}\), where \(\mu_m\) is the mean MSE for model \(m\) and \(\mu_{best}\) is the lowest MSE among all models and estimators. As indicated by the p.values, the differences in mean MSE were not significant. Further, the box plot of the distribution of MSE scores in Figure 39, reveals significant overlap in the distributions of MSE. Though other experiments showed differences which were significant, this plot supports the notion that the each of the four models should be given some consideration.

Table 24: Mean MSE by prior and estimator for top four models with p.values for the probability that \(\mu_m = \mu_{best}\), where \(\mu_m\) is the mean MSE for model \(m\) and \(\mu_{best}\) is the lowest MSE among all models and estimators.
Prior Estimator Mean MSE p.value
Akaike Information Criterion (AIC) MPM 99.88 1.00
Akaike Information Criterion (AIC) BMA 101.63 0.03
Zellner’s g-prior MPM 101.68 0.02
Bayesian Information Criteria (BIC) MPM 101.92 0.01

Figure 39: Distribution of MSE among top 4 models

5.3 Parameter Estimates

As indicated in Table 25, there was some variation among the selected predictors as well as their estimates. Markedly, all four models produced precisely the same estimate for the intercept, although with slightly different levels of uncertainty. The single greatest positive influencer of audience scores, IMDb ratings would generate between 14-15 audience score points for each point on the IMDb rating scale. All estimates had similar variations with \(SD \approx 0.58\). Like IMDB rating, critics score was a positive factor in all four models, but with far less influence on audience scores than was IMDb rating. A point on the critics score scale would result in an estimated 0.7 points on the audience score scale. There was a slight negative association between runtimes and audience scores (approximate 0.05 point drop for each minute of run time) in the AIC (MPM and the g-prior (MPM) models. Run time wasn’t a factor in the EB-Global (MPM) and Hyper-G-Laplace (MPM) models. An MPAA R rating wasn’t necessarily good for a film’s audience score. The AIC (MPM) and g-prior (MPM) models associated an R rating with a drop of approximately 1.5 points in the audience score. The Oscar performance of the films were relatively significant predictors in the AIC(MPM) model, but the devil is in the details. Only Best Picture nominations were positively associated with audience scores, with a relatively high estimate of 5 points; however, individual performance was not rewarded. Best actor and best actress winners were negatively associated with audience scores. It would appear that audiences prefer ensembles.

Table 25: Coefficient Estimates by Prior and Estimator
Terms Mean SD 2.5% 97.5%
AIC (MPM)
Intercept 62.3907 0.3913 61.6222 63.1592
feature_filmyes -4.8894 2.0606 -8.9358 -0.8430
dramayes 1.6979 0.8961 -0.0619 3.4576
runtime -0.0662 0.0232 -0.1118 -0.0207
mpaa_rating_Ryes -1.4881 0.8101 -3.0789 0.1026
thtr_rel_year -0.1013 0.0397 -0.1793 -0.0233
imdb_rating 14.2430 0.6231 13.0194 15.4665
critics_score 0.0624 0.0219 0.0193 0.1055
best_pic_nomyes 4.1994 2.3364 -0.3886 8.7874
best_actor_winyes -1.8900 1.1683 -4.1841 0.4042
best_actress_winyes -2.2362 1.3050 -4.7990 0.3265
imdb_num_votes_log 0.6652 0.2192 0.2347 1.0957
Terms Mean SD 2.5% 97.5%
AIC (BMA)
Intercept 62.3907 0.3936 61.6342 63.1461
feature_filmyes -3.0951 2.7581 -8.2560 0.0070
dramayes 0.7515 0.9893 -0.1756 3.0441
runtime -0.0554 0.0289 -0.1000 0.0000
mpaa_rating_Ryes -1.0027 0.9733 -2.8649 0.0006
thtr_rel_year -0.0729 0.0534 -0.1625 0.0000
oscar_seasonyes -0.3496 0.7349 -2.3544 0.4259
summer_seasonyes 0.2694 0.6540 -0.5826 2.0751
imdb_rating 14.5169 0.6935 13.2281 15.9119
critics_score 0.0644 0.0248 0.0160 0.1167
best_pic_nomyes 2.9877 2.9291 -0.0289 8.7413
best_pic_winyes -0.5984 2.7371 -8.2414 4.3568
best_actor_winyes -1.0102 1.2676 -3.7747 0.2247
best_actress_winyes -1.2409 1.4562 -4.2942 0.1893
best_dir_winyes -0.7120 1.3876 -4.2047 0.9884
top200_boxyes 0.2480 1.5317 -3.1588 4.1925
imdb_num_votes_log 0.4712 0.3087 0.0000 0.9858
Terms Mean SD 2.5% 97.5%
g-prior (MPM)
Intercept 62.3907 0.3943 61.6165 63.1649
runtime -0.0688 0.0220 -0.1120 -0.0257
mpaa_rating_Ryes -1.6788 0.7914 -3.2328 -0.1249
thtr_rel_year -0.0676 0.0374 -0.1411 0.0059
imdb_rating 14.7885 0.5914 13.6271 15.9498
critics_score 0.0721 0.0218 0.0293 0.1149
imdb_num_votes_log 0.3787 0.1803 0.0248 0.7327
Terms Mean SD 2.5% 97.5%
BIC (MPM)
Intercept 62.3907 0.3981 61.6089 63.1725
imdb_rating 14.6058 0.5737 13.4792 15.7324
critics_score 0.0746 0.0218 0.0319 0.1174

5.4 Top Model Evaluations

Having explored prediction accuracy, and characterized the parameter estimates in terms of their effects on audience score, the focus turned to determining the quality of those model inferences. Specifically, the distribution of errors vis-a-vis predictions were evaluated for homoscedacticity. Figure 40, presented very similar distributions of errors vis-a-vis fitted values for each of the nine priors. A greater concentration of residuals was observed at the higher audience scores; whereas the residuals for lower scores were more dispersed. Also, the quality of the predictions tended to be positively associated with the quantity of observations at any particular range of audience scores.

5.4.1 Residuals vs Fitted

Figure 40: Residual vs Fitted Values

Though residuals appeared to center at zero, the complete lack of homoscedacticity among the models could not be ignored. As such, the quality and reliability of the above inferences that depend upon \(\epsilon \sim N(0, \sigma^2)\) are suspect. That said, the overarching objective was to explore relative prediction accuracy of various default parameter priors and estimators. It was therefore concluded that inference using any of the nine priors and four estimators would be suspect; however, the 36 models were on “equal footing” with respect to prediction. Hence all four models would advance to the prediction phase.


6 Prediction

Five films from 2016 were selected from the BoxOfficeMojo.com, IMDb, and Rotten Tomatoes websites and predictions were rendered using the four “best” models from the model evaluation section:
* Akaike Information Criterion (Median Prediction Model)
* Zellner’s g-prior (Median Predictive Model)
* Empirical Bayes-Global (Median Predictive Model)
* Hyper-g Laplace (Median Predictive Model)

6.1 Prediction Data

The five movies selected for prediction (Table 26) were so chosen to present a range of values for audience score and predictor values across the parameter space. Audience scores ranged from 34 to 87. The number of IMDb votes varied from approximately 8000 votes for Synchronicity, to over 460,000 for Suicide Squad. Similarly critics scores stretched from a low 26 points to a near perfect 98. Results at the Academy varied significantly among the chosen films in order to provide a reasonably diverse sampling.

Table 26: Movies selected for prediction
Title Feature Drama Run R Year Oscar Summer IMDb Rating Critics BP Nom BP Win Actor Actress Dir Top # Votes Audience
Split yes no 117 no 2016 no no 7.3 75 no no no no no no 252594 78
Moonlight yes yes 111 yes 2016 yes no 7.5 98 yes yes no no no no 184405 79
Rogue One: A Star Wars Story yes no 133 no 2016 yes no 7.8 85 no no no no no yes 411283 87
Syncronicity yes yes 101 yes 2016 no no 5.5 46 no no no no no no 8176 34
Suicide Squad yes no 130 no 2016 no yes 6.1 26 no no no no no yes 466363 60

6.2 Prediction Results

Table 27: summarizes the predictions, true values, and mean squared errors for each movie and film. Mean MSE across the four models and five films ranged between 6.4 and 6.7., where the Zellner’s g-prior performed best. All models under-predicted audience scores for Split, a ‘middle-of-the-pack’ film with no Academy awards, with critics scores and IMDB ratings in the 70th percentile. By contrast, all models over-predicted audience scores for Moonlight, a film enjoyed by the critics and the Academy, but had the 2nd lowest number of votes. The audience score for Rogue One was a blockbuster film with over 400,000 votes. The largest variance among all the models was the significant over prediction for Synchronicity, a relatively unknown outlier, as indicated by the number of votes, which was panned by the critics. Predictions for Suicide Squad, the most frequently rated of the five, were significantly lower than the true audience scores.

Table 27: Prediction Results
Film AIC EB-global g-prior hyper-g-laplace Audience Score
Split 75.2 75.4 75.8 75.4 78
Moonlight 83.4 80.1 79.2 80.1 79
Rogue One: A Star Wars Story 82.5 83.5 83.2 83.5 87
Syncronicity 45.7 47 46.2 47 34
Suicide Squad 53.4 54.3 53.8 54.3 60
MSE 6.7 6.7 6.4 6.7

The thread that goes through all the predictions involves estimate for the IMDb (log) number of votes parameter. Models over predicted audience scores for films with the highest number of votes and penalized those with few votes. A log transformation was applied to the this variable to address significant right skew. Future studies might experiment with more sophisticated transformations or higher order polynomials to more precisely characterize the relationship between votes and audience scores,

7 Conclusion

7.1 Default Priors and Prediction Models

7.2 Default Priors and Predictive Performance

This paper documents the exploration of default priors and their effect on predictive accuracy in Bayesian model averaging. Predictive accuracy was evaluated for BIC and AIC model selection techniques, Zellner’s g, local and global empirical Bayes, as well the g-prior mixtures, Zellner Siow, Hyper g and Hyper-g/n priors. BMA, highest probability, best predictive, and median predictive models were fit for each prior, for a model space of 36 separate models. A data set of 645 observations was randomly split into a training (80%) set, which was used to develop the models, and a test (20%) set which was used to evaluate the predictive accuracy of each model on a mean squared error basis. This process was repeated for 400 trials, MSE was recorded and averages were computed for each model and estimator. The AIC prior generated predictions with the lowest mean MSE of the 36 models during this experiment, followed by Zellner’s g-prior, global empirical Bayes and Hyper-g using Laplace approximation. In each case, the median probability models estimators outperformed BMA, highest probability models, and the best predictive model. That said, the differences in mean MSE were not significant enough to make any conclusive statements with respect to the relative performance accuracy of these four priors.

7.3 Default Priors and Prediction Models


References

Amini, Shahram M, and Christopher F Parmeter. n.d. “BAYESIAN MODEL AVERAGING IN R.” https://core.ac.uk/download/pdf/6494889.pdf.

Bottolo, Leonard, and Sylvia Richardson. 2010. “Evolutionary Stochastic Search for Bayesian Model Exploration.” Bayesian Analysis 5 (3): 583–618. doi:10.1214/10-BA523.

Clyde, Merlise. 2017. “Bayesian Variable Selection and Model Averaging using Bayesian Adaptive Sampling.” https://github.com/merliseclyde/BAS.

Draper, David. 1995. “Assessment and Propagation of Model Uncertainty.” J.R. Statist. Soc. B 57 (109): 45–97. https://classes.soe.ucsc.edu/ams206/Winter11/draper-1995-model-uncertainty.pdf.

Eicher, Theo S, Chris Papageorgiou, and Adrian E Raftery. 2011. “JOURNAL OF APPLIED ECONOMETRICS DEFAULT PRIORS AND PREDICTIVE PERFORMANCE IN BAYESIAN MODEL AVERAGING, WITH APPLICATION TO GROWTH DETERMINANTS.” J. Appl. Econ 26: 30–55. doi:10.1002/jae.1112.

Fernah Ndez, Carmen, Eduardo Ley, and Mark F J Steel. 2001. “Benchmark priors for Bayesian model averaging.” Journal of Econometrics 100: 381–427. https://eclass.aueb.gr/modules/document/file.php/OIK164/Fernandez Ley and Steel JE 2001.pdf.

Flixter. 2017. “Rotten Tomatoes: Movies | TV Shows | Movie Trailers | Reviews.” Accessed November 24. http://www.rottentomatoes.com/.

Fritz, ‎Ben. 2008. “Box Office Mojo.” http://www.boxofficemojo.com/.

Hodges, James S. 1987. “Uncertainty, Policy Analysis and Statisitcs.” Statistical Science 2 (3): 259–91. https://projecteuclid.org/download/pdf{\_}1/euclid.ss/1177013224.

Hoel, Paul G., J. Wolfowitz, and Harald Cramer. 1947. Mathematical Methods of Statistics. Vol. 42. 237. doi:10.2307/2280199.

Ley, Eduardo, and Mark F J Steel. 2008. “M P RA On the Effect of Prior Assumptions in Bayesian Model Averaging with Applications to Growth Regression On the Effect of Prior Assumptions in Bayesian Model Averaging with Applications to Growth Regression.” http://mpra.ub.uni-muenchen.de/6773/.

Liang, Feng, Rui Paulo, German Molina, Merlise A Clyde, and Jim O Berger. 2008. “Mixtures of g Priors for Bayesian Variable Selection.” Journal of the American Statistical Association 103 (481). Taylor & Francis: 410–23. doi:10.1198/016214507000001337.

Maruyama, Yuzo, and Edward I George. 2011. “FULLY BAYES FACTORS WITH A GENERALIZED g-PRIOR.” The Annals of Statistics 39 (5): 2740–65. doi:10.1214/11-AOS917.

Needham, Col. 1990. “IMDb - Movies, TV and Celebrities.” http://www.imdb.com/ http://www.imdb.de/.

Raftery, Adrian E, David Madigan, and Jennifer A Hoeting. 1997. “Bayesian Model Averaging for Linear Regression Models.” Journal of American Statistical Association 92 (437): 179–91. https://www.stat.washington.edu/raftery/Research/PDF/rmh1997.pdf.

Raftery, Adrian E. 1988. “INFERENCE FOR THE BINOMIAL N-PARAMETER - A HIERARCHICAL BAYES APPROACH.” Biometrika 75 (2): 223–28. doi:10.1093/biomet/75.2.223.

Raftery, Adrian, David Madigan, and Jennifer Hoeting. 1994. “Model Selection and Accounting for Model Uncertainty in Linear Regression Models.” Journal of the American Statistical Association 89 (428): 1535–46. https://www.stat.washington.edu/raftery/Research/PDF/madigan1994.pdf.

Wasserman, Larry, and Robert E. Kass. 1996. “The Selection of Prior Distributions by Formal Rules.” Journal of the American Statistical Association 91 (435): 1343–70. http://mathfaculty.fullerton.edu/sbehseta/KassWasserman-JASA-1996.pdf.

Zellner, A, and A Siow. 1980. “Posterior odds ratios for selected regression hypotheses.” Trabajos de Estadistica Y de Investigacion Operativa 31 (1): 585–603. doi:10.1007/BF02888369.

Zellner, Arnold, and Brent R Moulton. 1985. “Bayesian regression diagnostics with applications to international consumption and income data.” Journal of Econometrics 29 (1): 187–211. doi:https://doi.org/10.1016/0304-4076(85)90039-9.

John James jjames@datasciencesalon.org

06 March, 2018